fix: disable HTTP/2, raise broadcast timeout to 30s, restore bounded retry for idempotent reads#257
Merged
Merged
Conversation
- Force HTTP/1.1 by setting NextProtos to ["http/1.1"] and ForceAttemptHTTP2=false to prevent HTTP/2 multiplexing - HTTP/2 multiplexes all requests over a single TCP connection, which causes all traffic to hit one ALB target behind the scenes - Set MaxConnsPerHost=20 to encourage connection rotation across multiple backend instances - This should improve load distribution to steemd instances and reduce broadcast_transaction_synchronous failures caused by uneven load on a single steemd backend
Extracts upstream timeout resolution into selectUpstreamTimeout(req) so the policy is explicit and testable: - defaultUpstreamTimeout (used when the per-URN config sets timeout=0) raised from 3s to 15s to match the production upstream config. - broadcastMinimumTimeout introduced at 15s: any broadcast_transaction* request always gets at least 15s regardless of config. A 3s clip on a synchronous broadcast surfaces to wallets as transaction_expiration_exception once the caller retries with a fresh expiration, even though the original tx already made it into a block. The existing callHTTPUpstream now delegates to selectUpstreamTimeout instead of inlining the timeout-or-default branch. Test coverage: - Table-driven selectUpstreamTimeout cases for non-broadcast / broadcast, config 0 / sub-minimum / above-minimum, both broadcast methods. - Safety assertion that both constants stay above one block-confirmation window (10s) so future refactors can't silently shrink them.
Three orthogonal improvements that together close the gap between the Go rewrite and the legacy Python jussi behaviour observed by wallets during upstream steemd incidents: 1. Selective upstream retry (internal/upstream/retry.go, HTTPClient.RequestWithRetry, processor.callHTTPUpstream) Commit 9cf36ea removed all upstream retry to fix a different bug (retry-induced expiration), but this also removed legacy's implicit tolerance for transient ALB/steemd blips on read paths. Idempotent methods now retry up to 2 attempts with 100ms→500ms backoff + 25% jitter on retriable failures (connection reset, EOF, i/o timeout, HTTP 5xx). Non-retriable errors (context cancel/deadline, 4xx) short-circuit immediately. broadcast_transaction* methods are still single-attempt — a retry could submit the same tx twice. The retry budget is intentionally tighter than legacy (≤2 vs 3 attempts, ≤500ms vs 5s backoff) so the worst-case retry tail stays well under typical wallet timeouts, avoiding the issue 9cf36ea fixed. 2. Broadcast minimum timeout 15s → 30s (internal/handlers/processor.go const broadcastMinimumTimeout) Production broadcast_synchronous traces show p95 wall time of ~3s (one block confirmation), but ALB-fronted steemd nodes can hang ≥15s during fork/restart windows. The 15s floor was clipping these recoverable cases. 30s covers ~10 blocks of upstream slowness while still bounding wallet response time. Production upstream config in the orchestration repo is updated in a parallel commit. 3. Observability for upstream config and broadcast errors (Router.logConfigSummary, selectUpstreamTimeout throttled log, processor.ProcessRequest broadcast-error structured log) - At startup each upstream emits one INFO line with its parsed URL count, translate_to_appbase flag, namespace timeout, and broadcast timeout — operators can verify "the config I shipped is the config that's running" in one glance. - selectUpstreamTimeout emits a throttled (1/method/minute) INFO log when the broadcast floor actually overrides a smaller configured value, surfacing config drift without log-spam. - Broadcast upstream errors emit a structured WARN log with method, namespace, upstream URL, jussi_request_id, and the upstream error message — the signal operators need to correlate "wallets seeing failed broadcasts" with an upstream incident, without paging through individual jaeger traces. Test coverage - internal/upstream/retry_test.go: IsRetriableUpstreamError truth table (15 cases incl. context cancel, wrapped errors, 5xx vs 4xx), backoff exponential-with-cap, jitter bounded-and-positive, default config safety assertion (worst-case budget ≤ 2s). - internal/upstream/http_retry_test.go: httptest-backed RequestWithRetry covering success-after-transient-500, no-retry-on-4xx, give-up-after-max-attempts, context-cancellation-stops-loop. - internal/handlers/processor_timeout_test.go: extended for the 30s floor (sub-minimum, equal-to-minimum, above-minimum, previous-15s raised to new 30s).
…log throttle - Replace string-matching 5xx detection with UpstreamStatusError type - Switch broadcast method detection from map to prefix match (broadcast_*) - Make shouldLogBroadcastFloor concurrency-safe with CompareAndSwap/LoadOrStore - Add nil-guard for req.URN in selectUpstreamTimeout - Expand tests for edge cases and concurrent log throttling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Wallets reported "broadcast timeout" after swapping traffic from legacy Python jussi to the Go rewrite. Triage via jaeger + watchtower revealed three independent gaps, addressed by this PR.
What this PR does
1.
disable HTTP/2 and add connection rotation for upstream clients(existing commit04b14b8)Force HTTP/1.1 for upstream clients, cap
MaxConnsPerHost, so the connection pool actually distributes across ALB backend instances instead of pinning to one.2.
enforce 15s minimum upstream timeout for broadcast methods(6172593)Extracts
selectUpstreamTimeout(req)as a pure function. RaisesdefaultUpstreamTimeout3s → 15s and introducesbroadcastMinimumTimeout = 15s(raised again in commit 3) so a forgotten or zeroed config entry can no longer clip a synchronous broadcast.3.
selective retry, 30s broadcast timeout, upstream observability(31a5ecb)Three orthogonal improvements:
HTTPClient.RequestWithRetry+IsRetriableUpstreamError.broadcast_transaction*remains single-attempt — a retry could submit the same tx twice. The retry budget is intentionally tighter than legacy (3 attempts/multi-second backoff) so it cannot reintroduce the expiration-on-retry pattern that motivated9cf36ea.orchestrationPR updates the production config to match.selectUpstreamTimeoutactually floors a sub-minimum config to the broadcast minimum, surfacing config drift without log-spam.method/namespace/upstream_url/jussi_request_id/upstream_message.Diagnostic evidence
upstream returned error: missing required active authority— not a timeout. Aggregating 500 traces in a 30-min window revealed 67% of broadcast errors weretransaction_expiration_exception, all withexpiration - jussi_entry_time = -80 to -96 minutes— i.e. wallets had signed those tx ~1.5 h earlier when the upstream was returning a stalehead_block_time. Companionget_dynamic_global_propertiestraces showed the upstream actively returningBad Cast: object_type to Arrayduring that window.broadcast_transaction_synchronousOK p95 = 3.07 s, max = 3.27 s. Sync broadcasts routinely sit just under the 3 s ceiling because they wait one block confirmation — under the previous 3 s default, any upstream slowness clipped them. The 30 s floor protects against that.Test plan
go test ./... -race -count=1 -short— full repo PASSgo vet ./...— cleango build ./...— cleanTestSelectUpstreamTimeout8 + 1 cases (incl. 30 s floor sub/equal/above)TestIsRetriableUpstreamError18 cases (incl. wrapped context errors, 5xx vs 4xx)TestRetryConfigBackoff*(exponential cap + jitter bounded)TestDefaultRetryConfigSafe(asserts worst-case budget ≤ 2 s)TestRequestWithRetry*(httptest: success-after-transient-500, no-retry-on-4xx, give-up-after-max-attempts, context-cancellation-stops-loop)timeout_applied=30sINFO upstream registeredlines on startup match production configWARN broadcast upstream returned erroronly fires on actual upstream errorOut of scope (follow-up)
head_block_timeupstream-lag detector with Prometheus gauge — needs a smallinternal/telemetrychange, will land in a separate PR.jussi-next(db=1) and legacy (db=0) soget_blockcache hits stop being polluted across versions (orchestration-side change).